{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "tags": [
     "hide_input",
     "hide_output"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import json\n",
    "pd.options.display.max_colwidth = 10009\n",
    "path = \".benchmarks/Darwin-CPython-3.11-64bit/0006_b5b7ee569dab10ff304d1123984a2f446917fe9e_20241205_124128.json\"\n",
    "\n",
    "# Load the JSON data from the file\n",
    "with open(path, 'r') as file:\n",
    "    data = json.load(file)\n",
    "\n",
    "# Extract the benchmark statistics\n",
    "benchmarks = data['benchmarks']\n",
    "\n",
    "# Create a DataFrame from the benchmark statistics\n",
    "df = pd.DataFrame([{\n",
    "    'name': benchmark['name'],\n",
    "    'rounds': benchmark['stats']['rounds'],\n",
    "    'median': benchmark['stats']['median'],\n",
    "\n",
    "    'iterations': benchmark['stats']['iterations']\n",
    "} for benchmark in benchmarks])\n",
    "\n",
    "df['comparison_type'] = df['name'].str.extract(r'\\[(.*?)-(?:duckdb|spark)\\]')[0]\n",
    "df['backend'] = df['name'].str.extract(r'-(duckdb|spark)\\]')[0]\n",
    "\n",
    "\n",
    "df['name'] = df['name'].str.replace(r'test_comparison_execution_\\w+\\[.*?\\]', '', regex=True)\n",
    "\n",
    "df = df.drop('name', axis=1)\n",
    "\n",
    "# Get exact match times for each backend\n",
    "exact_match_times = df[df['comparison_type'] == 'Exact Match'].set_index('backend')['median']\n",
    "\n",
    "# Calculate multiples using merge and division\n",
    "df['multiple_of_exact_match'] = df.apply(\n",
    "    lambda x: x['median'] / exact_match_times[x['backend']],\n",
    "    axis=1\n",
    ")\n",
    "\n",
    "df['comparison_type'] = df['comparison_type'].apply(\n",
    "    lambda x: f\"{x}*\" if x == 'Cosine Similarity Level' else x\n",
    ")\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "tags": [
     "hide_input",
     "hide_output"
    ]
   },
   "outputs": [],
   "source": [
    "import altair as alt\n",
    "\n",
    "\n",
    "\n",
    "# Create base chart function to avoid code duplication\n",
    "def create_runtime_chart(data, backend_name):\n",
    "    return alt.Chart(data).mark_bar().encode(\n",
    "        y=alt.Y('comparison_type:N',\n",
    "                sort=alt.EncodingSortField(field='median', order='ascending'),\n",
    "                title='Comparison Type'),\n",
    "        x=alt.X('median:Q',\n",
    "                title='Median Runtime (seconds)'),\n",
    "        tooltip=[\n",
    "            alt.Tooltip('comparison_type', title='Comparison'),\n",
    "            alt.Tooltip('median', title='Runtime (s)', format='.3f'),\n",
    "            alt.Tooltip('multiple_of_exact_match', title='Times slower than exact match', format='.1f'),\n",
    "            alt.Tooltip('rounds', title='Rounds'),\n",
    "        ]\n",
    "    ).properties(\n",
    "        title=f'{backend_name} Comparison Runtimes',\n",
    "        height=400\n",
    "    )\n",
    "\n",
    "# Create charts for both backends\n",
    "duckdb_df = df[df['backend'] == 'duckdb'].sort_values('median')\n",
    "spark_df = df[df['backend'] == 'spark'].sort_values('median')\n",
    "\n",
    "duckdb_chart = create_runtime_chart(duckdb_df, 'DuckDB')\n",
    "spark_chart = create_runtime_chart(spark_df, 'Spark')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Comparing the execution speed of different comparisons \n",
    "\n",
    "An important determinant of Splink performance is the computational complexity of any similarity or distance measures (fuzzy matching functions) used as part of your [model config](https://moj-analytical-services.github.io/splink/topic_guides/comparisons/).\n",
    "\n",
    "For example, you may be considering using [Jaro Winkler](https://moj-analytical-services.github.io/splink/topic_guides/comparisons/comparators.html?h=levenshtein#jaro-winkler-similarity)  or [Levenshtein](https://moj-analytical-services.github.io/splink/topic_guides/comparisons/comparators.html?h=levenshtein#levenshtein-distance), and wish to know which will take longer to compute.\n",
    "\n",
    "This page contains summary statistics from performance benchmarking these functions.  The code used to generate these results can be found [here](https://github.com/moj-analytical-services/splink_speed_testing/), and raw results can be found [here](https://github.com/moj-analytical-services/splink_speed_testing/tree/main/.benchmarks/Darwin-CPython-3.11-64bit).\n",
    "\n",
    "The timings are based on making 10,000,000 comparisons of the named function.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### DuckDB\n",
    "\n",
    "The following chart shows the performance of different functions in DuckDB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "tags": [
     "hide_input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<style>\n",
       "  #altair-viz-eec3aed448cf4155b11640ee22fb81a3.vega-embed {\n",
       "    width: 100%;\n",
       "    display: flex;\n",
       "  }\n",
       "\n",
       "  #altair-viz-eec3aed448cf4155b11640ee22fb81a3.vega-embed details,\n",
       "  #altair-viz-eec3aed448cf4155b11640ee22fb81a3.vega-embed details summary {\n",
       "    position: relative;\n",
       "  }\n",
       "</style>\n",
       "<div id=\"altair-viz-eec3aed448cf4155b11640ee22fb81a3\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-eec3aed448cf4155b11640ee22fb81a3\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-eec3aed448cf4155b11640ee22fb81a3\");\n",
       "    }\n",
       "\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm/vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm/vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm/vega-lite@5.20.1?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm/vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      let deps = [\"vega-embed\"];\n",
       "      require(deps, displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"5.20.1\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 300, \"continuousHeight\": 300}}, \"data\": {\"name\": \"data-b01dc8647493fb52aa9c3d5321fcb3d2\"}, \"mark\": {\"type\": \"bar\"}, \"encoding\": {\"tooltip\": [{\"field\": \"comparison_type\", \"title\": \"Comparison\", \"type\": \"nominal\"}, {\"field\": \"median\", \"format\": \".3f\", \"title\": \"Runtime (s)\", \"type\": \"quantitative\"}, {\"field\": \"multiple_of_exact_match\", \"format\": \".1f\", \"title\": \"Times slower than exact match\", \"type\": \"quantitative\"}, {\"field\": \"rounds\", \"title\": \"Rounds\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"median\", \"title\": \"Median Runtime (seconds)\", \"type\": \"quantitative\"}, \"y\": {\"field\": \"comparison_type\", \"sort\": {\"field\": \"median\", \"order\": \"ascending\"}, \"title\": \"Comparison Type\", \"type\": \"nominal\"}}, \"height\": 400, \"title\": \"DuckDB Comparison Runtimes\", \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.20.1.json\", \"datasets\": {\"data-b01dc8647493fb52aa9c3d5321fcb3d2\": [{\"rounds\": 5, \"median\": 0.021056458999737515, \"iterations\": 1, \"comparison_type\": \"Exact Match\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 1.0}, {\"rounds\": 5, \"median\": 0.02219412500016915, \"iterations\": 1, \"comparison_type\": \"Cosine Similarity Level*\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 1.0540293123571165}, {\"rounds\": 5, \"median\": 0.024377040999752353, \"iterations\": 1, \"comparison_type\": \"Absolute Date Difference Level (date)\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 1.1576989749347804}, {\"rounds\": 5, \"median\": 0.060411000000385684, \"iterations\": 1, \"comparison_type\": \"Jaccard Level\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 2.86900090851642}, {\"rounds\": 5, \"median\": 0.11403295899981458, \"iterations\": 1, \"comparison_type\": \"Distance In KM Level\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 5.415580986396435}, {\"rounds\": 5, \"median\": 0.1533886249999341, \"iterations\": 1, \"comparison_type\": \"Jaro Level\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 7.284635322674445}, {\"rounds\": 5, \"median\": 0.15666545900057827, \"iterations\": 1, \"comparison_type\": \"Jaro-Winkler Level\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 7.4402566453614645}, {\"rounds\": 5, \"median\": 0.29996850000134145, \"iterations\": 1, \"comparison_type\": \"Absolute Date Difference Level (string)\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 14.24591380749635}, {\"rounds\": 5, \"median\": 0.6126142920002167, \"iterations\": 1, \"comparison_type\": \"Levenshtein Level\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 29.093889528522027}, {\"rounds\": 5, \"median\": 0.6153485419999924, \"iterations\": 1, \"comparison_type\": \"Array Subset Level\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 29.223742795864357}, {\"rounds\": 5, \"median\": 0.6402568749999773, \"iterations\": 1, \"comparison_type\": \"Array Intersect Level\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 30.406673553609302}, {\"rounds\": 5, \"median\": 0.6663422500005254, \"iterations\": 1, \"comparison_type\": \"Raw SQL Token Frequency Product\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 31.64550364374332}, {\"rounds\": 5, \"median\": 2.774612832999992, \"iterations\": 1, \"comparison_type\": \"SQL Levenshtein for all pairs in array\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 131.77015342582433}, {\"rounds\": 5, \"median\": 4.7111222080002335, \"iterations\": 1, \"comparison_type\": \"Damerau-Levenshtein Level\", \"backend\": \"duckdb\", \"multiple_of_exact_match\": 223.73762882253666}]}}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.Chart(...)"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "duckdb_chart"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Spark\n",
    "\n",
    "The following chart shows the performance of different functions in Spark\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "tags": [
     "hide_input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<style>\n",
       "  #altair-viz-df036fc6f2344a048027df5d2e87da9b.vega-embed {\n",
       "    width: 100%;\n",
       "    display: flex;\n",
       "  }\n",
       "\n",
       "  #altair-viz-df036fc6f2344a048027df5d2e87da9b.vega-embed details,\n",
       "  #altair-viz-df036fc6f2344a048027df5d2e87da9b.vega-embed details summary {\n",
       "    position: relative;\n",
       "  }\n",
       "</style>\n",
       "<div id=\"altair-viz-df036fc6f2344a048027df5d2e87da9b\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-df036fc6f2344a048027df5d2e87da9b\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-df036fc6f2344a048027df5d2e87da9b\");\n",
       "    }\n",
       "\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm/vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm/vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm/vega-lite@5.20.1?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm/vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      let deps = [\"vega-embed\"];\n",
       "      require(deps, displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"5.20.1\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 300, \"continuousHeight\": 300}}, \"data\": {\"name\": \"data-31c46d027045b902496fd297243e48fb\"}, \"mark\": {\"type\": \"bar\"}, \"encoding\": {\"tooltip\": [{\"field\": \"comparison_type\", \"title\": \"Comparison\", \"type\": \"nominal\"}, {\"field\": \"median\", \"format\": \".3f\", \"title\": \"Runtime (s)\", \"type\": \"quantitative\"}, {\"field\": \"multiple_of_exact_match\", \"format\": \".1f\", \"title\": \"Times slower than exact match\", \"type\": \"quantitative\"}, {\"field\": \"rounds\", \"title\": \"Rounds\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"median\", \"title\": \"Median Runtime (seconds)\", \"type\": \"quantitative\"}, \"y\": {\"field\": \"comparison_type\", \"sort\": {\"field\": \"median\", \"order\": \"ascending\"}, \"title\": \"Comparison Type\", \"type\": \"nominal\"}}, \"height\": 400, \"title\": \"Spark Comparison Runtimes\", \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.20.1.json\", \"datasets\": {\"data-31c46d027045b902496fd297243e48fb\": [{\"rounds\": 5, \"median\": 0.13899920900075813, \"iterations\": 1, \"comparison_type\": \"Exact Match\", \"backend\": \"spark\", \"multiple_of_exact_match\": 1.0}, {\"rounds\": 5, \"median\": 0.161567749999449, \"iterations\": 1, \"comparison_type\": \"Absolute Date Difference Level (date)\", \"backend\": \"spark\", \"multiple_of_exact_match\": 1.162364528265537}, {\"rounds\": 5, \"median\": 0.4531747080000059, \"iterations\": 1, \"comparison_type\": \"Array Subset Level\", \"backend\": \"spark\", \"multiple_of_exact_match\": 3.2602682508612997}, {\"rounds\": 5, \"median\": 0.4993427919998794, \"iterations\": 1, \"comparison_type\": \"Array Intersect Level\", \"backend\": \"spark\", \"multiple_of_exact_match\": 3.5924146301951683}, {\"rounds\": 5, \"median\": 0.5383288749999338, \"iterations\": 1, \"comparison_type\": \"Distance In KM Level\", \"backend\": \"spark\", \"multiple_of_exact_match\": 3.8728916435560263}, {\"rounds\": 5, \"median\": 0.804795791998913, \"iterations\": 1, \"comparison_type\": \"Jaro-Winkler Level\", \"backend\": \"spark\", \"multiple_of_exact_match\": 5.789930732588007}, {\"rounds\": 5, \"median\": 1.220977999999377, \"iterations\": 1, \"comparison_type\": \"Levenshtein Level\", \"backend\": \"spark\", \"multiple_of_exact_match\": 8.784064375450637}, {\"rounds\": 5, \"median\": 1.479715290999593, \"iterations\": 1, \"comparison_type\": \"Jaccard Level\", \"backend\": \"spark\", \"multiple_of_exact_match\": 10.64549432789594}, {\"rounds\": 5, \"median\": 1.6129342909989646, \"iterations\": 1, \"comparison_type\": \"Absolute Date Difference Level (string)\", \"backend\": \"spark\", \"multiple_of_exact_match\": 11.603909853833537}, {\"rounds\": 5, \"median\": 2.331908291998843, \"iterations\": 1, \"comparison_type\": \"Jaro Level\", \"backend\": \"spark\", \"multiple_of_exact_match\": 16.776414116040936}, {\"rounds\": 5, \"median\": 14.258774750000157, \"iterations\": 1, \"comparison_type\": \"Damerau-Levenshtein Level\", \"backend\": \"spark\", \"multiple_of_exact_match\": 102.58169706507026}]}}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.Chart(...)"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark_chart"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Caveats and notes\n",
    "\n",
    "These charts are intended to provide a rough, high level guide to performance.  Real world performance can be sensitive to a number of factors:\n",
    "\n",
    "- For some functions such as Levenshtein, a longer input string will take longer to compute.\n",
    "- For some functions, it may be simpler to compute the result when comparing two similar strings\n",
    "- For the cosine similarity function, we used an embeddings length of 10.  This is far lower than many typical applications e.g. OpenAI's can have a length of [1,536](https://openai.com/index/new-embedding-models-and-api-updates/).  The reason was that we were running out of memory (RAM) for longer lengths, causing spill to disk, which in turn prevented the test being a pure test of the function itself.\n",
    "\n",
    "If you wish to run your own benchmarks, head over to the [splink_speed_testing](https://github.com/moj-analytical-services/splink_speed_testing/) repo, create tests [like these](https://github.com/moj-analytical-services/splink_speed_testing/blob/main/benchmarks/test_comparison_levels.py) and then run using the command \n",
    "\n",
    "```bash\n",
    "pytest benchmarks/\n",
    "```\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
